Programs for "Collaborative Production in Science: An Empirical Analysis of Coauthorships in Economics"
Katharine A. Anderson
Seth Richards-Shubik

This analysis requires the software R, MATLAB, and KNITRO, on a machine with 500 GB or more of RAM. Briefly, the analysis should proceed by running 'master_c1.R' (or 'master_c2.R') in R and then 'master_c1.m' (or 'master_c2.m') in MATLAB.

The programs in R (called by 'master_c1.R' or 'master_c2.R') create the analytic files. The two master programs correspond to the two cost functions (c1 and c2). Similarly, all programs ending with '_c1.R' or '_c2.R' have different versions depending on the cost function.

There are three global parameters in the master programs in R: 'L', 'K', and 'samp'. The last one controls whether the full population or a 50% sample is used. K is the threshold for the number of prior projects, used to define experienced researchers, and L is the maximum number of current projects. The default for K is 2, and values of 1 or 3 are used to assess the sensitivity of the characteristics of experienced researchers to this threshold. L is not adjusted for the analysis presented in the paper, although the programs should be able to accommodate an increase in L to 4. This may be computationally feasible for cost function c2.

The programs in MATLAB (called by 'master_c1.m' or 'master_c2.m') generate the recovered sets of structural parameters via an MCMC procedure, as described in Appendix A.5 in the supplementary material. As with the programs in R, the programs ending with '_c1.m' or '_c2.m' have different versions depending on the cost function.

There is one global parameter in the master programs for MATLAB: 'samp', which indicates whether the full population or a 50% sample is used. This must be set equal to the 'samp' variable in the master program run in R.

The MATLAB programs output a .mat file containing a sequence of structural parameter vectors (stored as a matrix, 'parameters') and their associated log pseudo-densities ('densities'). A vector is considered to be in the recovered set if its log pseudo-density is zero.

The master programs generate a sequence of 100 parameter vectors by default, but this can be adjusted based on desired run times. To generate a recovered set efficiently, it is best to run multiple chains with different starting values. For example, start a chain with points recovered in a previous chain, that are at different locations in the recovered set. The MCMC procedure is embarrassingly parallelizable, so multiple chains can be run at once, to the extent your hardware permits.

The programs 'plot_results_c1.R' and 'plot_results_c2.R' summarize the results. The code at the beginning can easily be adapted to load more .mat files with the results from more runs of the MCMC procedure. After importing the results, these programs create the sub-plots that go into figures 3 and A2. Also they produce the numbers for table 4.

The program 'descriptives.R' generates the tables of descriptive statistics. It uses the file 'analytic_c2_pop.RData' (analytic file with cost function c2, using the full population).

For questions, please contact the corresponding author: Seth Richards-Shubik <sethrs@lehigh.edu>
